TR-16-190  DISTRIBUTION  STATEMENT  A:  Approved  for  public  release;  distribution  is  unlimited. 


1  The  effects  of  signal  erosion  and  core  genome  reduction  on  the  identification  of 

2  diagnostic  markers 

3 

4  Jason  W.  Sahl3b,  Adam  J.  Vazquez3,  Carina  M.  Hall3,  Joseph  D.  Busch3,  Apichai 

5  Tuanyokc,  Mark  Mayo4,  James  M.  Schuppb,  Madeline  Lummis3,  Talima  Pearson3, 

6  Kenzie  Shippy3,  Rebecca  E.  Colmanb,  Christopher  J.  Allender3,  Vanessa 

7  Theobaldd,  Derek  S.  Sarovichd,  Erin  P.  Priced,  Alex  Hutcheson6,  Jonas  Korlach6, 

8  John  J.  LiPumaf,  Jason  Ladner9,  Sean  Lovett9,  Galina  Koroleva9,  Gustavo 

9  Palacios9,  Direk  Limmathurotsakulhl,  Vanaporn  Wuthiekanunh,  Gumphol 

10  Wongsuwanh,  Bart  J.  Curried,  Paul  Keim3b,  David  M.  Wagner3# 

11 

12  3Center  for  Microbial  Genetics  and  Genomics,  Northern  Arizona  University, 

13  Flagstaff,  AZ;  bTranslational  Genomics  Research  Institute,  Flagstaff,  AZ; 

14  cEmerging  Pathogens  Institute,  University  of  Florida,  Gainesville,  FL;  dGlobal  and 

15  Tropical  Health  Division,  Menzies  School  of  Health  Research,  Darwin,  NT, 

16  Australia;  ePacific  Biosciences;  division  of  Pediatric  Infectious  Diseases, 

17  University  of  Michigan;  9Center  for  Genome  Sciences,  USAMRIID,  Fort  Detrick, 

18  MD;  hMahidol-Oxford  Tropical  Medicine  Research  Unit;  'Department  of  Tropical 

19  Hygiene,  Faculty  of  Tropical  Medicine,  Mahidol  University,  Bangkok 

20 

21  Running  head:  Core  genome  decay  and  signal  erosion  effects  on  diagnostics 

22 

23  #Address  correspondence  to  David  Wagner,  dave.wagner@nau.edu 

24 

25 

26 

27 

28 

29 

30 


1 


UNCLASSIFIED 


TR-16-190  DISTRIBUTION  STATEMENT  A:  Approved  for  public  release;  distribution  is  unlimited. 


31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 

50 

51 

52 

53 

54 

55 

56 

57 

58 

59 

60 
61 


Abstract:  Whole  genome  sequence  (WGS)  data  are  commonly  used  to  design 
diagnostics  for  the  identification  of  bacterial  pathogens.  To  do  this  effectively, 
genomics  databases  must  be  comprehensive  to  identify  the  strict  core  genome 
that  is  specific  to  the  target  pathogen.  As  additional  genomes  are  analyzed,  the 
core  genome  size  is  reduced  and  there  is  erosion  of  the  target-specific  regions 
due  to  commonality  with  related  species,  potentially  resulting  in  the  identification 
of  false  positives  and/or  false  negatives. 

Importance:  A  comparative  analysis  of  1,130  Burkholderia  genomes  identified 
unique  markers  for  many  named  species,  including  the  human  pathogens  B. 
pseudomallei  and  B.  mallei.  Due  to  core  genome  reduction  and  signature 
erosion,  only  38  targets  specific  to  B.  pseudomallei/mallei  were  identified.  By 
using  only  public  genomes,  a  larger  number  of  markers  were  identified  due  to 
undersampling  and  represent  potential  false  positives.  This  analysis  has 
implications  for  the  design  of  diagnostics  for  other  species  where  the  genomic 
space  of  the  target  and/or  closely  related  species  are  not  well  defined. 
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62  Introduction 

63  Whole  genome  sequence  (WGS)  data  are  routinely  used  to  develop  DNA- 

64  based  diagnostics  for  rapid  and  accurate  identification  of  clinical  pathogens  (1, 

65  2).  Validating  the  specificity  of  diagnostic  targets  ensures  that  assays  do  not 

66  produce  false  positives  (identifying  a  non-pathogen  as  a  pathogen)  or  false 

67  negatives  (not  identifying  a  pathogen  that  is  actually  present).  To  avoid  false 

68  positives  and  negatives,  DNA-based  diagnostics  must  be  conserved  across  the 

69  target  species  and  absent  from  non-target  species. 

70  Two  critical  issues  arise  during  the  process  of  identifying  specific  diagnostics 

71  from  bacterial  genomes.  First,  the  number  of  genes  in  the  core  genome  (i.e. 

72  genes  present  in  every  individual  of  a  species)  tends  to  reduce  as  the  number  of 

73  sequenced  genomes  increases  (3,  4).  Certain  pathogens  (e.g.  Yersinia  pestis) 

74  propagate  clonally,  are  highly  homogeneous,  and  show  little  variation  in  core 

75  genome  size  with  additional  sampling  (5).  In  this  case,  the  core  genome  size  is 

76  not  expected  to  drastically  reduce  as  more  genomes  are  analyzed.  In  contrast, 

77  the  core  genome  size  of  Burkholderia  pseudomallei  reduces  significantly  with 

78  each  new  genome  added  (6).  A  second  issue  arises  from  genomes  of  related 

79  species,  or  “near  neighbors”,  that  share  core  genes  with  the  target  species.  In  a 

80  process  of  signature  erosion,  this  genomic  overlap  often  increases  as  near 

81  neighbor  genomes  are  added  to  the  analysis  and  erodes  the  number  of  potential 

82  diagnostic  targets.  Unfortunately,  near  neighbors  are  often  under-sampled  (or  not 

83  sampled  at  all)  during  the  search  for  diagnostic  targets,  which  hinders  efforts  to 

84  identify  species-specific  targets. 

85  Burkholderia  represents  a  model  genus  for  the  demonstration  of  core  genome 

86  reduction  and  signal  erosion.  The  Burkholderia  genus  contains  a  diverse  set  of 

87  species,  including  plant  pathogens  (7)  and  human  pathogens,  such  as  B. 

88  pseudomallei,  the  causative  agent  of  melioidosis  (8),  and  B.  mallei,  the  causative 

89  agent  of  glanders  (9).  The  pseudomallei  group  includes  B.  pseudomallei,  B. 

90  mallei,  B.  oklahomensis,  B.  thailandensis,  and  the  newly  described  B. 

91  humptydooensis  (10).  The  B.  cepacia  complex  (Bcc)  is  a  diverse  group  within 

92  Burkholderia  that  is  associated  with  opportunistic  infections  and  is  comprised  of 
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93  at  least  20  genomic  species  (11,  12).  Most  of  the  relationships  between  these 

94  species  have  been  determined  through  gene  marker  analyses,  such  as  the  recA 

95  gene  (13,  14)  or  multilocus  sequence  typing  (15). 

96  From  a  genomics  perspective,  Burkholderia  whole  genome  sequencing 

97  efforts  have  focused  on  B.  pseudomallei  (16)  and  B.  mallei  (17).  Recent  studies 

98  have  begun  to  sequence  other  Burkholderia  spp.,  including  members  of  the  Bcc 

99  (18).  However,  large-scale,  whole  genome,  phylogenetics-based  studies  that 

100  define  the  overall  phylogenetic  structure  among  Burkholderia  species  using  high- 

101  resolution  methods  are  currently  lacking. 

102  In  this  study,  we  extensively  surveyed  the  environment  in  Australia,  the 

103  United  States,  and  Southeast  Asia  for  Burkholderia  spp.  We  sequenced  a  large 

104  collection  of  genomes  to:  1)  explore  the  genomic  diversity  of  Burkholderia  spp. 

105  that  grow  on  Ashdown’s  agar;  2)  identify  specific  diagnostic  markers  for  B. 

106  pseudomallei  and  B.  mallei,  and;  3)  understand  the  sampling  effect  of  core 

107  genome  size  reduction  and  signal  erosion  on  the  selection  of  highly  specific 

108  diagnostic  targets. 

109 

110  Results 

111 

112  Whole  genome  sequencing  of  Burkholderia  spp.  In  this  study,  we  analyzed 

113  the  whole  genome  sequences  of  829  Burkholderia  spp.  that  grow  on  Ashdown’s 

114  agar  (Table  1),  a  selective  medium  containing  the  aminoglycoside,  gentamicin. 

115  These  isolates  were  collected  from  diverse  geographic  locations  in  the  United 

116  States,  Thailand,  and  Australia  (Supplemental  Table  1).  To  understand  the 

117  effects  of  core  genome  reduction  and  signature  erosion  on  the  identification  of 

118  highly  specific  diagnostic  targets,  the  genomes  of  256  diverse  B. 

119  pseudomallei/mallei  strains  were  sequenced,  assembled,  and  deposited  in  public 

120  databases  (Supplemental  Table  1);  these  genomes  were  combined  with  160  B. 

121  pseudomalleilmallei  genome  assemblies  already  in  public  databases.  Most  of  the 

122  genomes  (n=779)  in  this  study  were  sequenced  on  the  lllumina  platform,  with  50 
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123  genomes  also  sequenced  on  the  PacBio  platform,  which  generated  highly 

124  contiguous,  and  often  finished,  assemblies  (Supplemental  Table  1). 

125 

126  Core  genome  SNP  phylogeny.  To  understand  the  phylogenetic  structure  of  the 

127  Burkholderia  genus,  genomes  sequenced  in  this  study  as  well  as  GenBank 

128  reference  genomes  (Supplemental  Table  2)  were  aligned  against  B. 

129  pseudomallei  K96243  (19)  with  NUCmer  (20)  and  SNPs  were  identified  with 

130  NASP.  The  maximum  likelihood  phylogeny  inferred  from  core,  orthologous  SNPs 

131  (n=1 05,877)  demonstrated  that  all  genomes  sequenced  in  this  study,  with  the 

132  exception  of  1  B.  gladioli  genome,  grouped  in  either  the  Burkholderia  cepacia 

133  complex  ( Bcc )  or  the  B.  pseudomallei  group  (Figure  1).  Based  on  the 

134  monophyletic  nature  and  complexity  of  the  latter  clade,  we  propose  to  name  it  the 

135  B.  pseudomallei  complex  (Bpc).  Multiple  additional  Burkholderia  genomes  from 

136  GenBank  were  analyzed  and  were  found  to  be  more  distantly  related  to  these 

137  two  groups.  Our  clade  naming  scheme  is  consistent  with  a  recently  published 

138  taxonomic  scheme  for  Burkholderia  (21).  As  such,  they  were  not  examined  in 

139  detail  in  this  study  but  were  included  for  marker  screening  purposes 

140  (“Paraburkholderia”  genomes  in  Supplemental  Table  2). 

141  The  work  performed  in  this  study  greatly  expands  the  known  genomic 

142  diversity  of  Burkholderia.  For  example,  at  the  onset  of  this  study,  only  two  B. 

143  ubonensis  genome  assemblies  were  available  in  Genbank.  This  is  likely  due  to 

144  the  fact  that  most  genome  sequencing  has  focused  on  clinically-relevant 

145  organisms,  whereas  we  sampled  environmental  as  well  as  clinical  isolates.  This 

146  study  adds  the  genomes  of  254  B.  ubonensis  isolates,  including  three  finished 

147  genomes  (3  contigs)  and  three  nearly  finished  genomes  (4-5  contigs) 

148  (Supplemental  Table  1 ).  All  of  these  genomes  are  publicly  available  and  will  help 

149  provide  phylogenetic  context  for  additional  Burkholderia  genomes  that  are 

150  sequenced,  including  from  clinical  isolates.  We  have  also  generated  the  first 

151  WGS  sequences  for  other  recently  described  species,  such  as  B.  stagnalis  and 

152  B.  territorii  (12),  including  completed  genomes,  which  will  provide  data  for 

153  additional  comparative  studies. 
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154 

155  Comparative  genomics.  Based  on  the  topology  of  the  core  genome  phylogeny 

156  (Figure  1),  pan-genome  statistics  were  calculated  for  each  major  clade  (Table  2) 

157  using  LS-BSR  (22).  The  core  genome  of  each  primary  clade  was  aligned  against 

158  all  surveyed  genomes  (n=1 , 1 30)  to  identify  species  or  clade  specific  markers.  A 

159  marker  was  determined  to  be  clade  specific  if  it  had  a  BSR  value  >  0.8  in  all 

160  target  genomes  and  <  0.4  in  all  non-target  genomes;  although  this  definition  is 

161  very  conservative,  it  was  used  to  identify  discriminatory  markers,  regardless  of 

162  genome  assembly  quality.  The  results  demonstrate  that  species-specific  markers 

163  were  identified  for  most  of  the  major  clades  (Table  2);  a  multi-FASTA  of  all 

164  species-specific  coding  regions  is  publically  available 

165  (https://gist.github.com/jasonsahl/3e4132ca1d09b717fcc2).  A  screen  of  these 

166  species-specific  markers  against  all  genomes  was  visualized  to  demonstrate 

167  their  specificity  to  each  targeted  clade  (Figure  2).  The  stability  of  markers  from 

168  clades  with  a  limited  number  of  representatives  is  unknown  and  will  need  to  be 

169  validated  with  additional  sequencing.  Markers  also  were  identified  for  the  B. 

170  cepacia  complex  ( Bcc )  and  the  B.  pseudomallei  complex  (Bpc),  which  can  help 

171  to  verify  results  obtained  through  diagnostic  sequencing  efforts. 

172 

173  Putative  new  species.  Based  on  the  phylogeny  (Figure  1),  five  divergent  clades 

174  were  identified  that  may  represent  novel  species  (PS-1  through  PS-5).  We  have 

175  generated  completed  or  nearly  completed  genomes  for  at  least  one  isolate  from 

176  each  of  these  clades  (Supplemental  Table  1).  A  BLASTN  alignment  of  the 

177  extracted  recA  sequences  against  the  GenBank  nucleotide  database  failed  to 

178  identify  a  close  match  to  a  named  species  for  any  of  these  clades.  To 

179  demonstrate  the  differences  between  genomes  in  these  putative  species,  one 

180  representative  was  compared  against  a  genome  of  the  nearest  species,  based 

181  on  closest  patristic  distance,  or  tree  path  length  distance,  to  the  nearest 

182  monophyletic  clade  in  the  global  phylogeny  (Figure  1).  For  each  pairwise 

183  comparison,  the  ANI  and  DDH  values  were  calculated  and  tabulated  (Table  3). 

184  The  results  demonstrate  that  many  of  the  clades  have  ANI  values  <  95% 
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185  compared  to  the  nearest  reference  genome  based  on  its  position  in  the 

186  phylogeny.  Putative  species  (PS)-2,  which  is  most  closely  related  to  B. 

187  oklahomensis,  demonstrated  AN  I  values  on  the  border  of  the  species  threshold 

188  when  compared  to  B.  oklahomensis  genomes.  All  of  the  genomes  from  PS-2 

189  have  been  isolated  from  Australia  whereas  all  B.  oklahomensis,  including  the  two 

190  publicly  available  genomes,  have  been  isolated  from  the  United  States 

191  (Supplemental  Table  1).  This  physical  separation,  combined  with  the  borderline 

192  ANI  values,  may  argue  for  separate  species,  but  biochemical  testing  is  required 

193  to  bolster  this  separation  and  is  currently  ongoing. 

194 

195  Core  genome  size  reduction  with  additional  sampling  and  signal  erosion 

196  with  the  inclusion  of  near  neighbor  genomes.  In  bacteria  with  highly  plastic 

197  genomes,  the  inclusion  of  additional  isolates  can  cause  the  core  genome  size  to 

198  decrease  (3).  To  demonstrate  this  effect  in  Burkholderia,  we  calculated  pan- 

199  genome  statistics  on  416  B.  pseudomallei/mallei  genomes.  The  results 

200  demonstrate  that  as  additional  genomes  are  added  to  the  analysis,  the  core 

201  genome  size  reduces  to  1,684  CDSs;  annotation  of  these  CDSs  is  provided  in 

202  Supplementary  Table  3.  This  analysis  includes  genomes  from  B.  mallei,  which 

203  has  undergone  significant  evolutionary  decay  (9),  and  isolates  from  a  chronic  B. 

204  pseudomallei  infection  that  have  also  undergone  substantial  genome  reduction 

205  over  time  due  to  long-term  host  adaptation  (23).  By  inclusion  of  a  diverse  set  of 

206  genomes,  the  minimum  set  of  genes  required  by  all  B.  pseudomallei/mallei  can 

207  be  identified.  From  randomly  subsampling  the  416  genomes  at  different  genome 

208  levels,  the  sampling  effect  on  the  core  genome  size  was  visualized  (Figure  3a). 

209  In  addition  to  core  genome  size  reduction,  the  effect  of  including  additional 

210  near  neighbor  genomes  on  accurate  diagnostics  was  also  investigated.  The  core 

211  genome  from  all  B.  pseudomallei/mallei  genomes  was  aligned  using  LS-BSR 

212  against  a  randomly  selected  subset  of  near  neighbor  genomes  ranging  from  10 

213  to  300  genomes,  with  each  iteration  performed  100  times.  By  the  time  that  300 

214  near  neighbor  genomes  are  randomly  selected,  the  number  of  B. 

215  pseudomallei/mallei  markers  converged  on  the  same  number  that  is  obtained 
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216  using  the  entire  set  of  714  near  neighbor  genomes  (Figure  3b).  This  result 

217  demonstrates  that  a  significant  number  of  near-neighbor  genomes  must  be 

218  sequenced  in  order  to  identify  a  set  of  molecular  markers  that  are  highly 

219  discriminatory  for  a  given  clade. 

220  When  we  considered  the  416  B.  pseudomallei/mallei  genomes,  a  surprisingly 

221  small  number  of  unique  markers  (n=38)  were  identified  (Supplemental  Table  3). 

222  Of  these  markers,  one  in  particular  (BPSS0060;  encodes  a  hypothetical  protein) 

223  only  contained  3  polymorphisms  across  all  of  the  diverse  B.  pseudomallei/mallei 

224  isolates  in  our  study.  This  gene  represents  a  highly  specific  diagnostic  target 

225  under  low  selection  for  mutation.  If  only  B.  pseudomallei  is  considered,  22 

226  conserved  markers  were  identified  (Supplemental  Table  4)  in  B.  pseudomallei 

227  that  are  missing  from  B.  mallei  and  all  other  considered  Burkholderia  genomes 

228  (Figure  2). 

229  If  we  only  consider  the  publicly  available  genomes  used  in  this  study  (n=298), 

230  the  core  genome  size  for  B.  pseudomallei/mallei  is  2,570  CDSs.  When  this  core 

231  genome  was  screened  against  other  Burkholderia  near-neighbor  genomes 

232  available  in  GenBank  (n=141),  63  markers  were  identified  that  were  unique  to  B. 

233  pseudomallei/mallei.  In  contrast,  if  near-neighbor  genomes  sequenced  in  the 

234  study  were  also  included  (n=573),  51  markers  were  identified.  By  not  including 

235  additional  non-target  reference  genomes,  thirteen  of  these  markers  would 

236  represent  false  positives  in  screening  studies.  If  only  target  genomes  in  GenBank 

237  are  considered,  25  false  positives  would  be  identified,  demonstrating  the  need  to 

238  include  large  numbers  of  target  and  non-target  genomes. 

239 

240  Discussion 

241  Accurate  design  of  highly  specific  diagnostics  is  important  for  the  detection  of 

242  dangerous  human  pathogens  in  both  environmental  and  clinical  settings.  Timely 

243  pathogen  identification  directly  from  clinical  specimens  could  inform  the  early 

244  treatment  of  potentially  deadly  infections.  However,  as  our  study  demonstrates, 

245  the  genomic  targets  of  molecular  assays  need  to  first  be  thoroughly  validated  to 

246  avoid  false  positives  and  false  negatives,  which  can  potentially  confound 
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247  diagnostic  tests  and  delay  appropriate  patient  treatment.  In  this  study,  we 

248  highlight  the  importance  of  exploring  the  strict  core  genome  size  and  signature 

249  erosion  before  designing  diagnostic  PCR  targets.  Our  genome-based  approach 

250  is  applicable  for  other  researchers  who  wish  to  develop  diagnostic  assays  for 

251  other  pathogens. 

252  The  effect  of  signature  erosion  and  core  genome  size  reduction  was 

253  highlighted  in  the  genus,  Burkholderia.  To  characterize  the  genomic  space  within 

254  B.  pseudomallei/mallei,  as  well  as  in  closely  related  genomes,  we  sequenced 

255  829  Burkholderia  genomes  from  diverse  locations.  A  large-scale  comparative 

256  genomics  analysis  of  these  genomes  demonstrated  that  specific  molecular 

257  markers  were  identified  for  many  of  the  major  Burkholderia  clades  identified  from 

258  the  core  genome  SNP  phylogeny  (Figure  2).  These  unique  coding  regions  were 

259  likely  acquired  horizontally,  based  on  the  lack  of  homology  of  these  regions  in 

260  other  lineages  within  the  genus  (Figure  2).  To  demonstrate  the  need  to  sequence 

261  a  large  collection  of  genomes  to  identify  specific  diagnostic  targets,  a  core 

262  genome  reduction  analysis  was  performed  (Figure  3a).  This  analysis 

263  demonstrated  that  sequencing  additional  genomes  causes  the  core  genome  size 

264  to  decline.  This  analysis  was  performed  by  including  a  large  number  of  draft 

265  genome  assemblies,  which  may  cause  genomic  elements  to  be  either  truncated, 

266  based  on  unresolvable  repeats,  or  missing  altogether,  based  on  either  insufficient 

267  coverage  or  assembly  algorithms  that  remove  either  short  contigs  or  regions  of 

268  anomalous  coverage.  Based  on  the  genome  panel  used  in  this  analysis, 

269  including  a  large  and  diverse  set  of  isolates  is  important  to  avoid  selecting 

270  potential  diagnostic  targets  that  are  susceptible  to  false  negative  results  when 

271  screening  either  clinical  or  environmental  samples.  If  only  genomes  in  public 

272  databases  were  selected,  multiple  markers  would  be  identified  that  represent 

273  potential  false  negatives. 

274  The  other  important  factor  to  consider  when  designing  diagnostic  markers  is 

275  the  effect  of  signature  erosion  that  can  be  introduced  due  to  the  inclusion  of  close 

276  relatives  to  the  clade  of  interest.  If  only  genomes  available  in  GenBank  were 

277  included  in  the  analysis,  63  markers  were  identified  that  appeared  to  be  specific 
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278  to  B.  pseudomallei/mallei.  However,  if  all  non  pseudomallei/mallei  genomes  were 

279  included  in  the  analysis,  only  51  B.  pseudomallei/mallei  markers  were  identified, 

280  demonstrating  the  impact  of  including  a  comprehensive  set  of  genomes  outside 

281  of  the  targeted  species  or  clade.  If  all  genomes  from  our  study  were  included, 

282  only  38  B.  pseudomallei/mallei  specific  markers  were  identified,  which 

283  demonstrates  the  need  to  include  diverse  genomes  from  both  the  targeted  clade 

284  as  well  as  from  clades  closely  related  to  the  targeted  clade. 

285  This  study  both  expands  the  known  genomic  diversity  of  the  Burkholderia 

286  genus  and  also  provides  a  framework  for  using  genomic  data  to  design  highly 

287  specific  diagnostic  targets.  For  some  species,  near  neighbor  genomes  are  not 

288  available  or  difficult  to  isolate,  which  complicates  the  identification  of  these 

289  targets,  and  highlights  the  need  for  continued  genome  sequencing.  The  reported 

290  sampling  effects  on  strict  core  genome  size  and  signature  erosion  must  be 

291  considered  when  interpreting  surveillance  results  for  human  pathogens. 

292 

293  Materials  and  Methods 

294 

295  Isolate  collection,  DNA  extraction,  genome  sequencing,  assembly. 

296  Burkholderia  isolates  were  collected  from  diverse  global  locations  with  a  focus  on 

297  highly  endemic  regions  for  B.  pseudomallei,  including  northern  Australia  and 

298  northeastern  Thailand  (Supplemental  Table  1).  Isolates  were  collected  by  the 

299  Menzies  School  of  Health  and  Research,  Northern  Arizona  University,  the 

300  University  of  Michigan,  the  James  Cook  University,  Mahidol  University,  and  the 

301  US  Army  Medical  Research  Unit  (USAMRU).  All  final  culture  and  DNA  extraction 

302  procedures  were  performed  at  Northern  Arizona  University,  and  whole  genome 

303  sequencing  (WGS)  was  performed  at  the  Translational  Genomics  Research 

304  Institute  (TGen;  lllumina)  and  the  U.S.  Army  Medical  Research  Institute  of 

305  Infectious  Diseases  (USAMRIID;  PacBio). 

306  Isolates  initially  grown  on  Ashdown’s  agar  were  streaked  from  a  single  purified 

307  colony  to  form  a  lawn  and  then  stored  at  -80°C  in  Luria  Bertani  (LB)  broth  with 

308  20%  glycerol.  Cultures  were  grown  on  LB  agar  plates  and  incubated  at  37°C  for 


10 


UNCLASSIFIED 


TR-16-190  DISTRIBUTION  STATEMENT  A:  Approved  for  public  release;  distribution  is  unlimited. 


309  24-48  hours.  High  molecular  weight  DNA  was  extracted  using  the  Qiagen® 

310  DNeasy  Blood  and  Tissue  Kit  (catalog  no.  69504;  Valencia,  CA)  for  whole 

311  genome  sequencing  on  lllumina  (lllumina,  Inc.;  San  Diego,  CA)  and  Pacific 

312  Biosciences  (Menlo  Park,  CA)  platforms.  Using  approximately  2.7pg  of  gDNA, 

313  libraries  were  prepared  for  lllumina  whole  genome  sequencing  as  previously 

314  described  (24). 

315  DNA  was  sequenced  on  multiple  platforms,  including  lllumina  HiSeq  2000, 

316  lllumina  MiSeq,  and  PacBio.  Raw  lllumina  reads  were  assembled  with  SPAdes 

317  v3.5.0  (25)  in  conjunction  with  a  pipeline  developed  to  identify  sequence 

318  contamination  between  multiplexed  samples 

319  (https://github.com/jasonsahl/UGAP).  Contigs  that  showed  an  anomalously  low 

320  depth  of  coverage  compared  to  other  contigs  from  the  same  assembly,  or  those 

321  that  aligned  to  other  organisms  multiplexed  in  the  same  lane,  were  manually 

322  removed.  Genome  assembly  information  is  shown  in  Supplemental  Table  1 . 

323  For  PacBio  assemblies,  genomic  DNA  was  sheared  to  20kb  average  size 

324  using  g-TUBEs  (Covaris  inc.).  After  DNA  damage  repair  and  end  repair,  hairpin 

325  adapters  were  ligated  to  form  a  SMRTbell  template.  Exolll  and  ExoVII  treatment 

326  was  used  to  remove  failed  ligation  products.  Size  selection  was  performed  on  the 

327  Blue  Pippin  system  (Sage  Sciences)  using  0.75%  dye-free  agarose  gel  cassette, 

328  marker  SI  and  Hi-Pass  protocol;  low  cut  was  set  on  4000  bp.  Final  library 

329  assessment  was  obtained  by  Cubit  dsDNA  BR  assay  and  Agilent  2100 

330  Bioanalyzer  DNA  12000  chip  analyses.  Annealing  of  sequencing  primer  and 

331  binding  polymerase  P4  to  the  SMRTbell  template  was  performed  according  to 

332  the  PacBio  calculator.  The  polymerase-template  complexes  were  bound  to 

333  MagBeads,  loaded  onto  SMRTcells  at  final  concentration  180  pM,  and 

334  sequenced  with  180  min  movies  on  the  PacBio  RS  II  instrument. 

335  PacBio  sequences  were  assembled  de  novo  using  the  Hierarchical  Genome 

336  Assembly  Pipeline  (HGAP)  (26).  Draft  assemblies  were  checked  for  overlapping 

337  ends  using  Gepard  (27)  and  BLAST  (28).  Overlapping  ends  are  typical  of  long- 

338  read  assemblies  of  circular  chromosomes.  Redundant  end  sequences  were 
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339  trimmed  to  one  copy  and  the  genome  was  rotated  to  create  a  new  breakpoint. 

340  Reads  were  then  re-aligned  to  the  trimmed  and  shifted  draft  assembly  for 

341  correction  using  the  Quiver  algorithm.  Contigs  that  did  not  have  identifiable 

342  homologous  ends  were  corrected  using  Quiver  without  further  processing. 

343 

344  Species  identification  using  core  genome  single  nucleotide  polymorphism 

345  (SNP)  phylogeny,  average  nucleotide  identity  (ANI),  and  digital  DNA  DNA 

346  hybridization  (DDH)  calculation.  To  model  the  evolutionary  relationships 

347  between  Burkholderia  spp.,  a  set  of  reference  genomes  (Supplemental  Table  2) 

348  was  downloaded  from  GenBank  (29)  and  combined  with  the  genomes 

349  sequenced  in  this  study.  For  a  number  of  these  genomes,  only  raw  reads  were 

350  available,  which  were  assembled  for  use  in  the  comparative  analyses  described 

351  below.  All  genomes  were  aligned  against  the  reference  genome  of  B. 

352  pseudomallei  K96243  (19)  using  NUCmer  (20).  Regions  that  aligned  more  than 

353  once  by  a  reference  self-alignment  (i.e.  duplicated  regions)  were  removed  from 

354  downstream  analyses.  All  SNP-based  methods  were  wrapped  by  the  Northern 

355  Arizona  SNP  Pipeline  (NASP)  (http://tqennorth.qithub.io/NASP/)  (30). 

356  Orthologous  SNPs  conserved  in  all  genomes  were  concatenated  and  a 

357  maximum  likelihood  phylogeny  was  inferred  with  RAxML  v8  (31)  using  the 

358  ASC_GTRGAMMA  substitution  model  and  Lewis  correction  (32). 

359  For  determining  species  differences,  the  average  nucleotide  identity  (ANI) 

360  was  calculated  with  default  values  in  JSpecies  (33).  JSpecies  calculates  ANIb, 

361  which  uses  BLASTN  alignments  (28),  or  ANIm,  which  uses  NUCmer  alignments. 

362  The  average  values  were  reported  over  the  entire  length  of  all  alignments.  To 

363  find  the  nearest  neighbor  to  which  to  query  target  genomes,  the  closest  patristic 

364  distances  were  chosen,  as  calculated  by  DendroPy  (34).  Digital  DNA-DNA 

365  hybridization  (DDH)  values  were  calculated  with  a  web  service  (ggdc.dsmz.de) 

366  (35)  and  the  range  of  reported  values  was  reported. 

367 

368  Identifying  B.  pseudomallei  and  B.  mallei  markers  for  diagnostics  using 

369  comparative  genomics  and  pan-genome  analysis.  Coding  DNA  sequences 
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370  (CDSs)  were  identified  for  each  species  with  Prodigal  (36)  and  were  de- 

371  replicated  with  USEARCH  (37).  Each  representative  CDS  was  then  aligned 

372  against  each  genome  with  BLAT  (38)  and  the  Blast  Score  Ratio  (BSR)  (39)  value 

373  was  calculated;  these  methods  were  all  wrapped  by  the  Large-Scale  Blast  Score 

374  Ratio  (LS-BSR)  pipeline  (22).  LS-BSR  was  performed  for  each  species  and  the 

375  number  of  core  CDSs  (BSR  value  >  0.8  in  all  genomes)  in  each  group  was 

376  calculated;  a  BSR  value  of  0.8  is  roughly  equivalent  to  80%  protein  identity  over 

377  100%  of  the  length  of  the  protein  (3).  These  core  CDSs  from  a  given  species  or 

378  clade  were  then  screened  against  all  other  genomes,  and  those  genes  with  a 

379  BSR  value  <  0.4  in  all  other  species  were  identified  as  suitable  species 

380  diagnostic  markers. 

381  The  pan-genome  was  calculated  for  each  clade  using  LS-BSR  in  conjunction 

382  with  BLAT.  A  CDS  was  determined  to  belong  to  the  core  genome  if  it  had  a  BSR 

383  value  >  0.8  in  all  genomes  queried  for  a  given  species  or  clade  of  interest.  Each 

384  core  CDS  was  then  screened  against  all  genomes  in  the  analysis  with  LS-BSR.  A 

385  CDS  was  determined  to  be  species  specific  if  it  was  in  the  core  genome  of  the 

386  species  or  clade  of  interest  and  missing  or  highly  divergent  (BSR  <  0.4)  in  all 

387  other  Burkholderia  genomes. 

388 

389  Core  genome  size  reduction  and  signal  erosion.  To  understand  the  sampling 

390  effect  on  the  core  genome  size  in  B.  pseudomallei/mallei,  a  set  of  416  B. 

391  pseudomallei/mallei  genomes  was  sampled  without  replacement  from  1  to  400, 

392  with  100  iterations  at  each  level.  From  each  sub-sampling,  a  set  number  of 

393  genomes  were  randomly  selected  with  a  python  script 

394  (https://gist.github.com/990d2c56c23bb5c2909d.git)  and  the  core  genome  (CDSs 

395  with  a  BSR  value  of  >  0.8  in  all  genomes)  was  calculated  and  plotted.  B. 

396  pseudomallei  and  B.  mallei  were  treated  as  a  single  species  for  this  and  many  of 

397  the  subsequent  analyses  as  B.  mallei  is  recognized  as  an  equine-adapted  clone 

398  within  B.  pseudomallei  (40). 

399  To  understand  the  erosion  of  B.  pseudomallei/mallei  specific  targets  with  the 

400  inclusion  of  sequences  from  other  Burkholderia  spp.,  the  core  genome  (n=1,684 
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401  CDSs)  from  a  set  of  416  B.  pseudomallei/mallei  genomes  was  used.  All 

402  Burkholderia  near  neighbor  genomes  (n=714)  were  then  randomly  sampled 

403  without  replacement  at  different  levels  from  1  to  300.  The  B.  pseudomallei/mallei 

404  core  genome  was  then  aligned  against  these  near  neighbor  genomes  to  identify 

405  core  regions  present  in  other  Burkholderia  species,  and  the  number  of  CDSs  with 

406  a  BSR  value  <  0.4  in  all  near  neighbor  genomes,  indicating  missing  genes,  was 

407  calculated  and  plotted. 

408 

409  Data  Availability.  Sequence  data  was  submitted  to  the  Sequence  Read  Archive 

410  for  each  isolate.  Furthermore,  genome  assemblies  for  all  isolates  were  submitted 

411  to  NCBI.  Individual  accession  numbers  are  show  in  Supplemental  Table  1  and  all 

412  data  is  deposited  under  PRJNA285704  and  PRJNA279182. 
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425 

426  Figure  legends 

427 

428  Figure  1.  A  core  genome  single  nucleotide  polymorphism  (SNP)  phylogeny  of 

429  Burkholderia  genomes.  All  SNPs  were  identified  by  aligning  genome  assemblies 

430  against  the  finished  genome  of  B.  pseudomallei  K96243  (19)  with  NUCmer  (20) 

431  and  processed  with  the  Northern  Arizona  SNP  pipeline 
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432  (http://tqennorth.qithub.io/NASP/)  (30).  A  maximum  likelihood  phylogeny  was 

433  inferred  on  the  concatenated  SNP  alignment  with  RAxML  v8  (31)  with  100 

434  bootstrap  replicates.  Clades  were  collapsed  with  ARB  (41).  Putative  novel 

435  species  are  named  PS  (putative  species)  and  the  clade  number. 

436 

437  Figure  2.  A  core  genome  single  nucleotide  polymorphism  (SNP)  phylogeny 

438  associated  with  a  heatmap  of  markers  unique  to  specific  clades.  The  core 

439  genome  phylogeny  was  inferred  with  RAxML  (31)  on  a  concatenated  SNP 

440  alignment  produced  by  aligning  1130  genomes  against  the  finished  genome  of  B. 

441  pseudomallei  K96243  (19)  with  NUCmer  (20)  in  conjunction  with  NASP 

442  (http://tqennorth.qithub.io/NASP/).  Coding  regions  unique  to  specific  clades  were 

443  aligned  against  all  genomes  with  LS-BSR  (22)  and  the  heatmap  was  visualized 

444  with  the  interactive  tree  of  life  (42).  The  heatmap  demonstrates  the  distribution  of 

445  identified  markers  against  all  genomes  screened  in  this  study. 

446 

447  Figure  3.  (A)  Core  genome  reduction  in  Burkholderia  pseudomallei/mallei.  The 

448  core  genome  was  calculated  with  the  LS-BSR  pipeline  (22)  on  416  genomes.  For 

449  sub-sampling,  genomes  were  randomly  selected  at  different  depths  and  the 

450  number  of  coding  regions  (CDSs)  with  a  blast  score  ratio  (BSR)  (39)  value  >  0.8 

451  in  all  genomes  was  calculated  and  plotted.  For  each  sub-sampling  level,  100 

452  iterations  were  performed.  The  mean  value  at  each  level  is  shown  in  red  and 

453  each  replicate  is  shown  in  black.  (B)  The  effect  of  signature  erosion  on  the 

454  design  of  B.  pseudomallei/mallei  diagnostic  markers.  Genomes  outside  of  the  B. 

455  pseudomallei/mallei  clade  (n=714)  were  randomly  selected  at  different  depths. 

456  The  core  genome  of  416  B.  pseudomallei/mallei  genomes  was  screened  against 

457  non  pseudomallei/mallei  genomes  with  LS-BSR  (22)  and  the  number  of  markers 

458  with  a  BSR  value  <  0.4  in  non  pseudomallei/mallei  genomes  was  calculated  and 

459  plotted.  One  hundred  independent  replicates  were  processed  at  each  sampling 

460  depth.  The  mean  value  at  each  level  is  shown  in  red  and  each  replicate  is  shown 

461  in  black. 

462 
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463 

464 

465 

466 

467 

468 

469 

470 


Table  1  .Summary  of  new  genomes 
sequenced  as  part  of  this  study 


clade 

#genomes 

B.  anthina 

8 

B.  cenocepacia  1 

1 

B.  cenocepacia  2 

4 

B.  cepacia 

78 

B.  diffusa 

12 

B.  gladioli 

1 

B.  humptydooensis 

5 

B.  lata 

2 

B.  latens 

2 

B.  metallica 

1 

B.  multivorans 

14 

B.  oklahomensis 

2 

putative  species  1 

3 

putative  species  2 

4 

putative  species  3 

10 

putative  species  4 

7 

putative  species  5 

8 

B.  pseudomallei 

256 

B.  pseudomultivorans 

9 

B.  pyrrocinia 

1 

B.  seminalis 

2 

B.  stagnalis 

67 

B.  thailandensis 

8 

B.  territorii 

33 

B.  ubonensis 

254 

B.  vietnamiensis 

37 

total 

829 
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472 

473 

474 

475 

476 

477 

478 

479 

Table  2.  Core  genome 
stats 


Species/clade 

core 

genome 

size 

#genomes 

#species/clade 

specific 

markers 

ambifaria 

5408 

2 

71 

anthina 

5507 

8 

13 

cenocepacia- 1 

3823 

8 

8 

cenocepacia- 2 

5076 

16 

22 

cepacia 

4415 

83 

7 

diffusa 

4566 

12 

7 

dolosa 

5451 

3 

436 

gladioli 

4898 

6 

833 

glumae 

3253 

3 

264 

humptydooensis 

5115 

7 

157 

lata 

4214 

7 

0 

latens 

5348 

3 

105 

multivorans 

4001 

21 

53 

oklahomensis 

5681 

4 

141 

PS-1 

3693 

3 

504 

PS-2 

4231 

4 

23 

PS-3 

5047 

11 

195 

PS-4 

4366 

7 

0 

PS-5 

4978 

8 

0 

pseudomallei 

2339 

392 

22 

pseudomallei/mallei 

1690 

416 

38 

pseudomultivorans 

4549 

10 

62 

pyrrocinia 

6397 

4 

153 

seminalis 

6533 

2 

90 

stagnalis 

4835 

67 

54 

thailandensis 

4447 

20 

116 

17 
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territorii 

4399 

33 

0 

ubonensis 

3128 

255 

40 

vietnamiensis 

3803 

40 

71 

18 
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480 

Table  3.  Average  nucleotide  identity  (ANI)  and  DNA  DNA  hybridization  (DDH)  values  between  representatives 
of  putative  novel  species  and  representatives  of  established  clades 


Genome 

clade 

nearest  genome 

ANIm 

(%) 

ANIb 

(%) 

DDH  (%) 

MSMB175 

Putative  species  1 

B.  gladioli  BSR3 

85.5 

79.8 

18.7-23.7 

BDU8 

Putative  species  2 

B.  oklahomensis  C6786 

94.9 

94.8 

59.3-75.8 

MSMB0852 

Putative  species  3 

B.  sp.  MSMB43 

92.4 

91.1 

44.5-52.7 

MSMB0856 

Putative  species  4 

B.  pyrrocinia  Lyc  2 

91.2 

89.8 

44.9-60.8 

NRF60-BP8 

Putative  species  5 

B.  cenocepacia  KC-01 

94.1 

93.5 

54.5-56.9 
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492 


— I  SB  glumae  (n=3) 
rPS-1  (n=3) 

-0  8  oklahomensis  (n=4) 

PS-2  (n=4) 

-TPS-3  (n=11 ) 

■O B  humptydooensis  (n=7) 

- □  B  thailandensis  (n=20) 

1 

B  pseudomallet/mallet  (n=416) 


-B  spp-  lig30 

O'7 B .  pseudomultivorans  (n=10) 

-OB  dolosa  (n=3) 

B  multivarans  (n=21) 

{78-  latens  (n=3) 

C78  -  vietnamiensis  (n=40) 

C778  ambifaria  (n=2) 

S  cepacia  GG4 
8  spp  A9 
8  spp  USM-B20 
S-4  (n=7) 

B  diffusa  (n=12) 

8  territorii  (n=33) 

08  seminalis  ( n  ~  2 ) 

DS  anthina  (n=8) 

8  cepacia  Bu72 
PS-5  (n=8) 

8  cenocepacia  CEIB  S5-1 
8  cenocepacia  1  (n=8) 

£7  B  cenocepacia  2  (n=1 6) 

8  cepacia  LK29 
pyrrocinia  DSM  10685 
pyrrocinia  Lyc2 
-B  spp  MShl 
B  spp  MSh2 
Bp7035,  8.  metallica 
7 8  lata  (n=7) 


'8  cepacia  (n=83) 


D  no  genomes  from  this  study 


1 


□  includes  genomes  from  this  study 
putative  new  species 
O  Boostrap  support  >  95% 


—08  pyrrocinia  (n=4) 

-  stagnalis  (n=67) 


8-  ubonensis  (n=255) 


B.  pseudomalle > 
complex  (Bpc) 


8.  cepacia 
complex  ( Bcc ) 
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